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Abstract 

Background: This paper introduces a new constrained nnodel and tlie corresponding algoritlim, called unsupervised 
Bayesian linear unmixing (uBLU), to identify biological signatures fronn high dimensional assays like gene expression 
microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture 
of random positive gene signatures, called factors, with random positive mixing coefficients, called factor scores, that 
specify the relative contribution of each signature to a specific sample. The particularity of the proposed method is 
that uBLU constrains the factor loadings to be non-negative and the factor scores to be probability distributions over 
the factors. Furthermore, it also provides estimates of the number of factors. A Gibbs sampling strategy is adopted 
here to generate random samples according to the posterior distribution of the factors, factor scores, and number of 
factors. These samples are then used to estimate all the unknown parameters. 

Results: Firstly, the proposed uBLU method is applied to several simulated datasets with known ground truth and 
compared with previous factor decomposition methods, such as principal component analysis (PGA), non negative 
matrix factorization (NMF), Bayesian factor regression modeling (BFRM), and the gradient-based algorithm for general 
matrix factorization (GB-GMF). Secondly, we illustrate the application of uBLU on a real time-evolving gene expression 
dataset from a recent viral challenge study in which individuals have been inoculated with influenza 
A/H3N2/Wisconsin. We show that the uBLU method significantly outperforms the other methods on the simulated 
and real data sets considered here. 

Conclusions: The results obtained on synthetic and real data illustrate the accuracy of the proposed uBLU method 
when compared to other factor decomposition methods from the literature (PGA, NMF, BFRM, and GB-GMF). The 
uBLU method identifies an inflammatory component closely associated with clinical symptom scores collected during 
the study. Using a constrained model allows recovery of all the inflammatory genes in a single factor. 



Background samples is much less than the number G of genes. For 
Factor analysis methods such as principal component example, in an Affymetrix HU133 gene chip, the num- 
analysis (PCA) have been widely studied and can be used ber G of genes may range from ten to twenty thousand 
for discovering the patterns of differential expression in depending on the type of chip description file (CDF) pro- 
time course and/or multiple treatment biological experi- cessing used to translate the oligonucleotide fragments to 
ments using gene or protein microarray samples. These gene labels whereas we only analyze about a hundred of 
methods aim at finding a decomposition of the observa- samples. 

tion matrix Y e R^^^ whose rows (respectively columns) This decomposition expresses each of the N samples as 

are indexed by gene index (respectively sample index). a particular linear combination of R characteristic gene 

Typically, in gene expression analysis, the number N of expression signatures, also referred to as factors, with 

appropriate proportions (or contributions), called factor 

^ , TV WTr . , TT. 7 WTr scores, following a linear mixing model 
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Y = MA + N (1) 
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where M g R^^^ represents the factor loading matrix, 
A G R^''^ the factor score matrix and N G R^''^ 
is a matrix containing noise samples. Each sample y/ 
(/ = 1, . . . , AT), corresponding to the i-th. column of the 
observed gene expression matrix Y, is a vector of G gene 
expression levels that can be expressed as 

R 

Yi = ^ mr^r,i + Hi = Mai + ni (2) 

r=l 

where is the r-th column of M, ar,i denotes the (r, 0-th 
element of the matrix A, a/ and are the i-th column of A 
and N respectively. The number of factors R in the decom- 
position is usually much less than the number of sam- 
ples N. Traditional factor analysis methods such as PCA 
require the experimenter to specify the desired number 
of factors to be estimated. However, some recent Bayesian 
factor analysis methods are totally unsupervised in the 
sense that the number of factors is directly estimated from 
the data [1-3]. 

The model (1) is identical to the standard factor anal- 
ysis model [4] for which the columns of M are called 
factors and should correspond to biological signatures (or 
pathways). Note that the elements of the matrix M are 
referred to as factor loadings, and the columns of A are the 
factor scores. Approaches to fitting the model (1) to data 
include principal component analysis [5,6], least squares 
matrix factorization [7,8], finite mixture modeling [9,10], 
and Bayesian factor analysis [4,11,12]. 

This paper presents a new Bayesian factor analysis 
method called unsupervised Bayesian linear unmixing 
(uBLU), that estimates the number of factors and incorpo- 
rates non-negativity constraints on the factors and factor 
scores, as well as a sum-to-one constraint for the factor 
scores. The uBLU method presented here differs from the 
BLU method, developed in [13] for hyperspectral imag- 
ing and applied to gene microarray expression analysis 
in [14]. Note that BLU requires user specification of the 
number of factors while uBLU determines the number of 
factors using Bayesian birth-death model. The positivity 
and sum-to-one constraints are natural in gene microar- 
ray analysis when the entries of the observation matrix are 
non-negative and when a proportional decomposition is 
desired. Thus each factor score corresponds to the con- 
centration (or proportion) of a particular factor to a given 
sample. The advantage of this representation for gene 
expression analysis is twofold: i) the factor scores corre- 
spond to the strengths of each gene contributing to that 
factor; ii) for each gene chip the factor scores give the rel- 
ative abundance of each factor present in the chip. For 
example, a gene having a large loading level (close to one) 
for a particular factor should have a small loading (close to 
zero) for all other factors. In this way, as opposed to other 
factor analysis methods, there is less multiplexing making 



it easier to associate specific genes to specific factors and 
vice versa. 

A similar approach, based on NMR spectral imaging 
and called the Bayesian decomposition (BD), has been 
previously developed by Moloshok et al and applied to 
gene expression data [11]. More recently, the coordi- 
nated gene activity in pattern sets method (CoGAPS), 
available as an open R-source [12], combines the GAPS- 
MCMC matrix factorization algorithm with a threshold- 
independent statistic to infer activity in specific gene 
sets. However, these approaches require cold restarts of 
the algorithm with different number of factors and with 
different random seeds to avoid the large number of 
local minima encountered in the process of fitting the 
matrix factorization model MA to the data Y. In contrast, 
the proposed uBLU algorithm uses a judicious model to 
reduce sensitivity to local minima rather than using cold 
restarts. The novelty of the uBLU model is that it con- 
sists of: (1) a birth-death process to infer the number 
of factors; (2) a positivity constraint on the loading and 
score matrices M, A to restrict the space of solutions; (3) 
a sum-to-one constraint on the columns of A to further 
restrict the solution space. The uBLU model is justified 
for non-negative data problems like gene expression anal- 
ysis and produces an estimate of the non-negative factors 
in addition to their proportional representation in each 
sample. 

Bayesian linear unmixing, traditionally used for hyper- 
spectral image analysis (see [13] for example), is one 
of many possible factor analysis methods that could 
be applied to gene expression analysis. These methods 
include non-negative matrix factorization (NMF) [7,8], 
independent component analysis (ICA) [15], Bayesian 
decomposition (BD) [11], PCA [5], bi-clustering [16], 
penalized matrix decomposition (PMD) [2], Bayesian fac- 
tor regression modeling (BFRM) [1], and more recently 
the gradient-based algorithm of Nikulin et al for general 
matrix factorization (GB-GMF) [17]. Contrary to uBLU, 
the PCA, ICA, BFRM, GB-GMF, bi-clustering and PMD 
methods do not account for non-negativity of the factor 
loadings and factor scores. On the other hand, NMF does 
not account for sum-to-one constraints on the columns of 
the factor score matrix. Contrary to PCA and ICA, uBLU 
does not impose orthogonality or independence on the 
factors, as well as the GB-GMF algorithm. These relaxed 
assumptions might better represent what is known about 
the preponderance of overlap and dependency in bio- 
logical pathways. Finally, uBLU naturally accommodates 
Bayesian estimation of the number of factors, like BFRM. 
Note that BFRM has been specifically developed for gene 
expression data [1]. 

In this paper we provide comparative studies that estab- 
lish quantitative performance advantages of the proposed 
constrained model and its corresponding uBLU algorithm 
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with respect to PCA, NMF, BFRM and GB-GMF for time- 
varying gene expression analysis, using synthetic data with 
known ground truth. We also illustrate the application 
of uBLU to the analysis of a real gene expression dataset 
from a recent viral challenge study [18] in which sev- 
eral subjects were administered viral inoculum and gene 
expression time course data were collected over a period 
of several days. Using these data, we may infer relation- 
ships between genes and symptoms and examine how the 
human host response to viral infection evolves with time. 

Methods 

Mathematical constrained model 

Let Yi represent a gene microarray vector of G gene 
expression levels. The elements of y/ have units of 
hybridization abundance levels with non-negative values. 
In the context of gene expression data, the starting point 
for Bayesian linear unmixing is the linear mixing model 
(LMM) 

R 

Yi = ^ mrar,i + n/, (3) 

r=l 

where R is the number of distinct gene signatures that 
can be present in the chip, = [mi,r> • • • > ^G,r] is the 
r-th gene signature vector, rrig^r > 0 is the strength of 
the ^-th gene {g = 1, . . . , G) in the r-th signature (r = 
1, . . . , 7?), and ar,i is the relative contribution of the r-th 
signature vector to the i-th. sample Yh where arj g[0, 1] 
and Ylr^i ^r,i = 1- Here n/ denotes the residual error of 
the LMM representation. For a matrix of N data samples 
Y = [yi, . . . ,y7v] e R^""^, the LMM can be rewritten 
with matrix notations 

Y = MA + N, (4) 

where M = [mi, . . . , mi?] G R^^^, A = [ai,...,aA^] 
e R^^^ and N = [ni, . . . , n^v] g R^^^ represent the fac- 
tor score matrix, the factor loading matrix and the noise 
matrix, respectively. The matrices M, A satisfy positivity 
and sum-to-one constraints defined by 

m^,r>0, ^r,/>0, and [ 1, . . . , 1] A =[ 1, . . . , 1] , 

(5) 

where rrig^r denotes the (g, r)-th element of matrix M. The 
constraints (5) arise naturally when dealing with positive 
data for which one is seeking the relative contribution of 
positive factors that have the same numerical characteris- 
tics as the data, i.e., the signature m^. is itself interpretable 
as a vector of hybridization abundances. 

The objective of linear unmixing is to simultaneously 
estimate the factor matrix M and the factor score matrix 
A from the available N data samples. The representa- 
tion (1) is rank deficient since A has rank N — 1. This 
presents algorithmic challenges for solving the unmixing 



problem. Several algorithms have been proposed in the 
context of hyperspectral imaging to solve similar prob- 
lems [6,19]. Most of these algorithms perform unmixing 
in a two step procedure where M is estimated first using 
an endmember extraction algorithm (EEA) followed by 
a constrained linear least squares step to estimate A. A 
common (but restrictive) assumption in these algorithms 
is that some samples in the dataset are "pure" in the 
sense that the linear combination of (2) contains a unique 
factor, say m^, with factor score ar,i^ Recently, this assump- 
tion has been relaxed by applying a hierarchical Bayesian 
approach, called Bayesian linear unmixing (BLU). The 
resulting algorithm requires the number R of factors to 
be specified (see [13] for details). Here we extend BLU to 
a fully unsupervised algorithm, called unsupervised BLU 
(uBLU), that estimates R using a birth-death model and 
a Gibbs sampler. The Gibbs sampler produces an esti- 
mate of the entire joint posterior distribution of the model 
parameters, resulting in a fully Bayesian estimator of the 
number of factors R^ the factor loadings M, and the fac- 
tor scores A. The uBLU model is described in the next 
subsection and the Gibbs sampling algorithm is given in 
the Appendix. In the Results and discussion section below 
we demonstrate the performance advantages of uBLU 
as a factor analysis model for simulated and real gene 
expression data. 

Unsupervised Bayesian linear unmixing algorithm 

The BLU algorithm studied in [13] generates samples dis- 
tributed according to the posterior distribution of M and 
A given the number R of factors for appropriate prior 
distributions assigned to the mixing parameters in (2). 
First, the residual errors n/ in (2) are assumed to be 
independent identically distributed (i.i.d.) according to 
zero-mean Gaussian distributions: n^- ~ Af (Oq.g^Ig) for 
/ = 1, . . . , A/^, where Ig denotes the identity matrix of 
dimension G x G. 

The number of factors R to be estimated by the pro- 
posed uBLU algorithm is assigned a discrete uniform prior 
distribution on [ 2, ... , i^max] 

V[R = k]= , for7? = 2,...,7?n.ax, (6) 

where T^max is the maximal number of factors present in 
the mixture. 

Because of the constraints in (5), the data samples y/ 
(/ = 1, . . . , A/') live in a lower-dimensional subspace of R^"^ 
(whose dimension is upper-bounded hyK — 1) denoted as 
Vk-i (^max — 1 ^ ^ G). This subspace can be iden- 
tified by a standard dimension reduction procedure, such 
as PCA. Hence, instead of estimating the factor loadings 
mr G R^ (r = 1, ... ,7?), we propose to estimate their 
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corresponding projections e M/"^ onto this subspace. 
Specifically, these projections can be represented as 



tr = P(m, - y) 



(7) 



where y = ^ Yl^i Yi empirical mean of the data 

matrix Y and P is the (K — 1) x G appropriate projection 
matrix that projects onto V/<:_i, which can be constructed 
from the principal eigenvectors of the empirical covari- 
ance matrix of Y. This dimension reduction procedure 
allows us to work in a lower-dimensional subspace with- 
out loss of information, and reduces significantly the com- 
putational complexity of the Bayes estimator of the factor 
loadings. A multivariate Gaussian distribution (MGD) 
truncated on a subset % is chosen as prior distribution for 
the projected factors t^. The subset % is defined in order 
to ensure the non-negativity constraint on (see [13]) 



treTrO {mg,r > 0, = 1, . . . , G}. 



(8) 



More precisely, % is obtained by noting that = P~^tr+ 
y and by looking for the vectors tr such that all compo- 
nents of P~^tr + y are non-negative. To estimate the mean 
vectors of these truncated MGDs, one can use a stan- 
dard endmember extraction algorithm (EEA) common in 
hyperspectral imaging, e.g. N-FINDR [19]. To summarize, 
the prior distribution for the projected factor tr is 



tr\er,s^ ^Afrr {er.S^lR-i) 



(9) 



where A/7; (e^,5^Ii?_i) denotes the truncated MGD with 
mean vector and covariance matrix s^Ir-i^ with a 
fixed hyperparameter. Assuming the vectors t^, for r = 
1, . . . ,7?, are a priori independent, the prior distribution 
for the projected factor matrix T = [ti, . . . , t;?] is 



/(T|E,s2,i?)af]exp 



r=\ 



l|tr-e,|| 

25? 



ir. (tr) (10) 



where a stands for "proportional to" || • || is the standard 
/2-norm, denotes the indicator function on the set 

A', E = [ei, . . .,qr] and = \s\, . . . 

The sum-to-one constraint for the factor scores a/, for 
each observed sample / (/ = 1, . . . , A/^), allows this vector 

to be rewritten as 



with ^i:R-i,i = [ai^i, . . . , UR-i^iY y 



and UR^i = 1 — X!r=i ^r4' Note here that any compo- 
nent of a/ could be expressed as a function of the others, 
i.e., ar,i = 1 — J2kj^r^k,i' The last component urj has 
been chosen here for notation simplicity. To ensure the 



positivity constraint, the subvectors b.i:r-ij must belong 
to the simplex 



<S = {^i:R-i,i I ||ai:i?_i,/||-^ < land a/ ^ 0}, 



(12) 



where |Mli is the h norm (||a/||i = Ylr^i\^r,i\) and 
a^' >: 0 stands for the set of inequalities {arj > 0}r=i,...,i?. 
Following the model in [13], we propose to assign uni- 
form distributions over the simplex S as priors for the 
subvectors ^i:R-i,i {i = 1, . . . , A/'), i.e., 



(13) 



For the prior distribution on the variance of the 
residual errors we chose a conjugate inverse-Gamma dis- 
tribution with parameters v/2 and y/2 



Mid- 



(14) 



The shape parameter v is a fixed hyperparameter whereas 
the scale parameter y will be adjustable, as in [13]. A non- 
informative Jeffreys' prior is chosen as prior distribution 
for the hyperparameter 7, i.e.. 



/(/) a -%+(/). 

y 



(15) 



The resulting hierarchical structure of the proposed 
uBLU model is summarized in the directed acyclic graph 
(DAG) presented in Additional file 1: Figure SI. 

The model defined in (1) and the Gaussian assumption 
for the noise vectors ni, . . .jIIm allow the likelihood of 
yi> • • • > yA/^ to be determined 



/(Y|0) = 



GN 
2 



exp 



1 Ui 



Ma,- 



2a2 



(16) 



Multiplying this likelihood by the parameter priors 
defined in (10), (13), (14) and (6), and integrating out the 
nuisance parameter y, the posterior distribution of the 
unknown parameter vector 0 = {M,A, or^,7?} can be 
expressed as 



-f 



f(e\Y)= f(e,y\Y)dy 



oc 



jfiY\e)fie\y)fiy)dy, 



(17) 



Considering the parameters to be a priori independent, 
the following result can be obtained 



(11) /Oly) =fiA\R)f{T\B.s\R)f{a^\v.y)fiR) (18) 



where / (A|i?), / (T|E, s^,7?) and f{(y^\v,y) are respec- 
tively the prior distributions of the factor score matrix A, 
the projected factor matrix T and the noise variance 
previously defined. 
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Due to the constraints enforced on the data, the poste- 
rior distribution/ (M, AyR\Y) obtained from the proposed 
hierarchical structure is too complex to derive analytical 
expressions of the Bayesian estimators, e.g., the minimum 
mean square (MMSE) and maximum a posteriori (MAP) 
estimators. In such case, it is natural to use Markov chain 
Monte Carlo (MCMC) methods [20] to generate samples 
M^^\ A^^^ and R^^^ asymptotically distributed according 
to/(M, A,i?|Y). However, the dimensions of the factor 
loading matrix M and the factor score matrix A depend 
on the unknown number R of signatures to be identified. 
As a consequence, sampling from/ (M, A,i?|Y) requires 
exploring parameter spaces of different dimensions. To 
solve this dimension matching problem, we include a 
birth/death process within the MCMC procedure. Specif- 
ically, a birth, death or switch move is chosen at each iter- 
ation of the algorithm (see the Appendix and [21]). This 
birth-death model differs from the classical reversible- 
jump MCMC (RJ-MCMC) (as defined in [21]) in the sense 
that, for the birth-death model, each move is accepted 
or rejected at each iteration using the likelihood ratio 
between the current state and the new state proposed 
by the algorithm. The factor matrix M, the factor score 
matrix A and the noise variance are then updated, 
conditionally upon the number of factors R, using Gibbs 
moves. 

After a sufficient number of iterations (Nmc iterations, 
including a burn-in period of Nbi iterations), the tradi- 
tional Bayesian estimators (e.g., MMSE and MAP) can be 
approximated using the generated samples M^^\ PS^^ and 
R^^^ . First, the generated samples are used to approximate 
the MAP estimator of the number of factors 

^MAP = argmax V[R = k\Y] 

/:€{2,...,i?max} 

N, (19) 
^ argmax — 

/:€{2,...,i?max} 

where Nj^ is the number of generated samples 7?(^bi+i)^ 
satisfying 7?^^ = k and Nr = Nmc - Nm- 
Then, conditioned on i?MAP> the joint MAP estimator 
(Mmap, Amap) of the factor and factor score matrices is 
determined as follows 

(Mmap, Amap) ^ argmax /(m^^\ A^^^|Y,i?= :Rmap) • 

(20) 



sampler algorithm is designed that generates samples dis- 
tributed according to the posterior distribution associated 
to the proposed uBLU model. For more details about the 
Gibbs sampling strategy, see the Appendix. 

Simulations on synthetic data 

To illustrate the performance of the proposed Bayesian 
factor decomposition, we first present simulations con- 
ducted on synthetic data. More extensive simulation 
results are reported in the Additional file 1. 

Simulation scenario 

Several synthetic datasets Pi, . . . , P4 were generated. The 
experiments presented here correspond to the expression 
values of G = 512 genes (for datasets Pi, P3 and P4) 
or G = 12000 genes (for dataset V2) with N = 128 
samples. Each sample is composed of exactly R = 3 
factors mixed using the linear mixing model in (1). The 
factors of the first dataset Vi have been generated so 
that only a few genes affect each factor. For the second 
dataset V2, reaUstic factors have been extracted from real 
genetic datasets. The third dataset D3 has been generated 
enforcing the factors to be orthogonal but not necessar- 
ily positive whereas in the forth dataset, P4, factors are 
orthogonal and positive. These simulation conditions are 
summarized in Table 1. 

In each case, the R = 3 factors were mixed in random 
proportions (factor scores), with positivity and sum-to- 
one constraints. All synthetic datasets were corrupted by 
an i.i.d. Gaussian noise sequence. The signal-to-noise ratio 

II II 2 

is SNR/ = 20 dB where SNR/ = Q-^a'^ Ef=i Mir^r,J 

for each sample / (/ = 1, . . .,N). 
Proposed method (uBLU) 

The first step of the algorithm consists of estimating 
the number of factors R involved in the mixture, and 
hence determining the dimensions of the matrices M 
and A, using the maximum a posteriori (MAP) estimator 
^MAP- The second step of the algorithm consists of esti- 
mating the unknown model parameters (M, A and o^) 
given i^MAP- The estimated posterior distributions of the 
unknown model parameters are given in Additional file 1: 
Figure S5 and vaUdate the proposed Bayesian model. 

The burn-in period and number of Gibbs samples were 
determined using quantitative methods described in the 
Additional file 1: Section "Convergence diagnosis'! 



Results and discussion 

The proposed method consists of estimating simultane- 
ously the matrices M and A defined in (1), under the 
positivity and sum-to-one constraints mentioned previ- 
ously, in a fully unsupervised framework, i.e., the number 
of factors R is also estimated from the data. A Gibbs 



Table 1 Synthetic datasets X>i , . . . , X>4 

P] Peaky factors 

V2 Realistic factors 

Orthogonal factors 
r>4 Orthogonal and positive factors 
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Comparison to other methods 

The performance of the proposed uBLU algorithm is com- 
pared with other existing factor decomposition methods 
including PCA, NMF, BFRM and GB-GMF by using the 
following criteria, which are common measures used to 
compare factor analysis algorithms, 

• the factor mean square errors (MSE) 

r Q \\ r r\\ > 

where ih^ is the estimated r-th factor loading vector, 

• the global MSE of factor scores 

1 ^ 

GMSE^ = - XI (^^'^' - ^^'0^ , r = 1, . . . 

i=l 

where S^,/ is the estimated proportion of the r-th 
factor in the i-th sample, 

• the reconstruction error (RE) 

RE = ^EI|y.-y'll' (21) 

i=l 

where y/ = Ylr=i ^r^r,i is the estimate of y/, 

• the spectral angle distance (SAD) between mr and its 
estimate ih^ for each factor r = 1, ... ,7? 

cAT^ / m^mr \ 

SAD^ = arccos I ■■ ^ ■■ I 

\\mr\ \\mr\\ J 

where arccos(-) is the inverse cosine function, 

• the global spectral angle distance (GSAD) between y/ 
(the i-th observation vector) and y/ (its estimate) 

GSAD =-Y arccos ( „ f „ , 

\\\nhi\\) 

• the computational time. 

The proposed uBLU algorithm, the PCA, NMF and 
GB-GMF methods were implemented in Matlab 7.8.0 
(R2009a). The BFRM software (version 2.0) was down- 
loaded from [22] and implemented with default values 
for the parameters. All methods were implemented on an 
Intel(R) Core(TM)2 Duo processor. 

Simulation results are reported in Tables 2, 3, 4 and 5. 
Note that the positivity and sum-to-one constraints that 
are enforced on the data for the proposed uBLU algorithm 
avoid the scale ambiguity inherent to any factor decom- 
position problem. Conversely, for the other factor decom- 
position methods (PCA, NMF, BFRM and GB-GMF), if 
{M, A} is an admissible solution, {MB, B^A} is also admis- 
sible for any scaling and permutation matrix B. Hence a 
re-scaling is required to identify appropriate permutations 



before computing MSEs and GMSEs. Moreover, when 
PCA, NMF, BFRM and GB-GMF methods are run for 
7? = 4, we only considered the 3 factors yielding the 3 
smallest SADs values. 

These results show that the uBLU method is more flex- 
ible since it provides better unmixing performance across 
all of the considered synthetic datasets Pi, . . . , P4 as com- 
pared to other existing factorization methods (PCA, NMF, 
BFRM and GB-GMF). Moreover, uBLU has the following 
advantages: i) it is fully unsupervised and does not require 
the number of factors to be specified as a prior knowledge, 
ii) due to the constraints, the factors and factor scores 
are estimated without scale ambiguity. The disadvantage 
is the execution time: uBLU requires more computation 
due to the Gibbs sampling. 

Evaluation on gene expression data 

Here the proposed algorithm is illustrated on a real time- 
evolving gene expression data from recent viral challenge 
studies on influenza A/H3N2/Wisconsin. The data are 
available at GEO, accession number GSE30550. 

Details on data collection 

We briefly describe the dataset. For more details the 
reader is referred to [14,18]. H3N2 dataset consists of 
the gene expression levels of N = 267 Affymetrix chips 
collected on 17 healthy human volunteers experimentally 
infected with influenza A/Wisconsin/67/2005 (H3N2). A 
clinical symptom score was assigned to each sample by 
clinicians who participated in the study. Nine of the 17 
subjects (those labeled ZOl, Z05, Z06, Z07, Z08, ZIO, Z12, 
Z13, and Z15 in Figure Ic) became clinically ill during the 
study. These labels are only used as ground truth to quan- 
tify performance and are not available to the uBLU algo- 
rithm. The challenge consists of inoculating intranasally 
a dose of 10^ TCID50 Influenza A manufactured and 
processed under current good manufacturing practices 
(cGMP) by Baxter Bioscience. Peripheral blood microar- 
ray analysis was performed at multiple time instants cor- 
responding to baseline (24 hours prior to inoculation with 
virus), then at 8 hour intervals for the initial 120 hours and 
then 24 hours for two further days. Each sample consisted 
of over G = 12000 gene expression values after stan- 
dard microarray data normalization with RMA using the 
custom brain array cdf [14]. No other preprocessing was 
applied prior to running the five unsupervised methods 
(uBLU, PCA, NMF, BFRM, and GB-GMF). 

Application of the proposed uBLU algorithm 

The uBLU algorithm was run with Nmc = 10000 Monte 
Carlo iterations, including a burn-in period of Nbi = 1000 
iterations. uBLU allows the posterior distribution of the 
number of factors depicted in Figure la, to be esti- 
mated. The results show that the MAP estimate of the 
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Table 2 Simulation results for dataset X>i 



(a) /? = 2 





uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE2(x10-2) 


0.39 


N/A 


N/A 


205.99 


267.42 




0.60 


6.04 


61.12 


N/A 


N/A 




0.54 


0.97 


9.78 


325.58 


67.14 


GMSE2(x10-3) 


0.04 


N/A 


N/A 


64.39 


226.58 




0.04 


2.00 


2.00 


N/A 


N/A 




0.05 


0.30 


0.28 


75.87 


41.33 


SAD2(x10-^) 


0.46 


N/A 


N/A 


21.69 


12.48 




0.29 


3.49 


3.50 


N/A 


N/A 




0.28 


1.49 


1.50 


23.24 


27.43 


GSAD(x10-2) 


3.39 


20.38 


20.38 


24.04 


37.35 


RE 


0.18 


9.12 


9.12 


1.94 


9.16 


Time (s) 


1.24 X 10^ 


0.03 


0.71 


47.15 


0.39 X 10^ 


(b)/? = 3 




uBLU 


PCA 


NMF 


BFRM 


GB-GMF 




0.39 


6.01 


0.48 


212.30 


40.27 


MSE2(x10-2) 


0.60 


6.53 


0.45 


681.42 


147.74 




0.54 


5.86 


0.28 


137.22 


94.90 




0.04 


6.62 


0.19 


76.09 


45.29 


GMSE2(x10-3) 


0.04 


2.40 


0.01 


142.72 


17.37 




0.05 


0.84 


0.05 


76.22 


33.78 




0.46 


1.86 


0.53 


10.68 


11.86 


SAD^(x10-^) 


0.29 


1.18 


0.31 


15.18 


12.50 




0.28 


1.36 


0.26 


5.33 


13.96 


GSAD(x10-2) 


3.37 


3.39 


3.38 


24.23 


33.38 


RE 


0.18 


0.18 


0.18 


1.84 


0.18 


Time (s) 


1.24 X 10^ 


0.10 


0.95 


53.60 


0.56 X 10^ 


(c) /? = 4 




uBLU 


PCA 


NMF 


BFRM 


GB-GMF 




0.39 


6.02 


87.78 


205.66 


195.89 


MSE2(x10-2) 


0.60 


6.53 


0.45 


247.96 


101.34 




0.54 


8.03 


0.26 


330.01 


68.69 




0.04 


23.82 


26.56 


64.59 


57.58 


GMSE2(x10-3) 


0.04 


11.70 


0.23 


114.02 


3.10 




0.05 


6.37 


18.04 


75.47 


27.72 




0.46 


1.86 


6.14 


9.74 


8.84 


SAD2(x10-^) 


0.29 


1.18 


0.31 


22.15 


26.80 




0.28 


1.36 


0.26 


8.17 


27.32 


GSAD(x10-2) 


3.39 


3.34 


3.36 


28.62 


29.23 


RE 


0.18 


0.18 


0.18 


2.08 


0.18 


Time (s) 


1.24 X 10^ 


0.11 


0.96 


63.88 


0.70 X 10^ 



MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods. 
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Table 3 Simulation results for dataset X>2 



(a) /? = 2 





uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


mse2 


0.09 


1.97 


N/A 


N/A 


N/A 




0.14 


N/A 


1.06 


37.67 


58.75 




0.14 


0.12 


26.68 


52.09 


1 50.09 


GMSE2(x10-1) 


0.34 


0.01 


N/A 


N/A 


N/A 




0.15 


N/A 


1.12 


1.17 


22.37 




0.09 


0.94 


6.24 


0.62 


1.18 


SAD2(x10-^) 


0.39 


0.44 


N/A 


N/A 


N/A 




0.48 


N/A 


1.32 


16.53 


13.34 




0.47 


0.44 


3.72 


15.21 


18.14 


GSAD(x10-2) 


1.51 


1.02 


1.53 


37.99 


1 29.40 


RE (x 10-2) 


0.64 


1.62 


1.65 


0.65 


5.47 


Time (s) 


22.06 X 10^ 


0.29 


32.02 


4.07 X 10^ 


9.24 X 10^ 


(b)/? = 3 




uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE^ 


0.09 


1.97 


14.87 


24.41 


61.00 




0.14 


0.01 


20.53 


50.59 


58.31 




0.14 


0.09 


14.02 


35.89 


65.11 


GMSE^(x10-^) 


0.34 


0.03 


0.34 


1.41 


4.80 




0.15 


0.02 


2.44 


0.65 


9.40 




0.09 


0.05 


0.92 


1.19 


5.40 


SAD2(x10-^) 


0.39 


0.44 


2.84 


14.35 


13.72 




0.48 


0.12 


4.75 


15.47 


13.62 




0.47 


0.37 


4.00 


17.50 


15.82 


GSAD(x10-2) 


1.02 


1.02 


1.49 


29.29 


129.29 


RE (x 10-2) 


0.64 


0.63 


1.55 


0.75 


1.62 


Time (s) 


22.06 X 10^ 


0.28 


45.91 


5.37 X 10^ 


16.59 X 10^ 


(c)/? = 4 




uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE^ 


0.09 


1.97 


13.13 


24.25 


64.90 




0.14 


0.01 


20.53 


50.52 


64.09 




0.14 


0.09 


14.02 


28.32 


69.99 


GMSE2(x10-^) 


0.34 


0.09 


0.20 


1.42 


15.12 




0.15 


0.48 


1.00 


0.65 


9.55 




0.09 


0.05 


0.44 


1.31 


7.73 


SAD2(x10-^) 


0.39 


0.44 


2.54 


14.74 


14.53 




0.48 


0.13 


5.52 


15.45 


14.55 




0.47 


0.37 


4.79 


16.45 


16.17 


GSAD(x10-2) 


1.02 


1.01 


1.06 


40.36 


129.29 


RE (x 10-2) 


0.64 


0.63 


0.69 


0.86 


1.50 


Time (s) 


22.06 X 10^ 


0.54 


55.86 


5.59 X 10^ 


16.59 X 10^ 



MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algoritlim and PCA, NMF, BFRM and GB-GMF methods. 
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Table 4 Simulation results for dataset X>3 



(a) /? = 2 





uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE2(x10-3) 


0.01 


0.83 


0.82 


N/A 


1.14 




0.85 


0.80 


0.92 


1.34 


2.30 




1.15 


N/A 


N/A 


1.36 


N/A 


GMSE2(x10-2) 


7.75 


7.29 


7.72 


N/A 


8.94 




7.76 


0.47 


0.48 


12.30 


11.86 




9.84 


N/A 


N/A 


11.05 


N/A 


SAD2(x10-^) 


0.59 


7.09 


7.04 


N/A 


15.55 




7.13 


6.71 


7.19 


8.41 


16.43 




8.71 


N/A 


N/A 


8.54 


N/A 


GSAD(x10-^) 


3.23 


2.58 


2.59 


6.59 


15.26 


RE (x 10-4) 


3.11 


0.70 


0.70 


0.47 


2.50 


Time (s) 


1.59 X 10^ 


0.01 


0.70 


42.02 


0.40 X 10^ 


(b)/? = 3 




uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE2(x10-3) 


0.01 


0.15 


0.15 


1.74 


1.20 




0.85 


1.02 


0.76 


1.76 


2.26 




1.15 


1.57 


1.03 


1.55 


2.40 


GMSE2(x10-2) 


7.75 


14.89 


2.80 


11.40 


14.09 




7.76 


0.11 


0.40 


12.11 


12.33 




9.84 


0.11 


0.30 


10.94 


12.76 


SAD2(x10-^) 


0.59 


2.60 


2.47 


11.34 


15.76 




7.13 


7.16 


6.59 


945 


16.40 




8.71 


8.80 


7.67 


9.06 


15.66 


GSAD(x10-^) 


3.23 


2.58 


1.71 


6.88 


15.20 


RE (x 10-4) 


3.11 


0.27 


0.29 


0.49 


2.44 


Time (s) 


1.59 X 10^ 


0.10 


1.24 


59.72 


0.54 X 10^ 


(c) /? = 4 




uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE2(x10-3) 


0.01 


0.02 


1.43 


1.43 


1.19 




0.85 


1.48 


5.49 


3.92 


2.06 




1.15 


1.68 


0.90 


1.88 


2.33 


GMSE2(x10-2) 


7.75 


13.78 


20.56 


16.66 


13.15 




7.76 


4.35 


12.36 


15.34 


11.75 




9.84 


3.99 


2.67 


11.25 


13.29 


SAD^(x10-^) 


0.59 


0.97 


10.27 


10.24 


15.97 




7.13 


7.93 


15.78 


16.45 


14.92 




8.71 


8.66 


6.93 


10.98 


15.89 


GSAD(x10-^) 


3.23 


1.17 


1.20 


5.51 


15.98 


RE (x 10-4) 


3.11 


0.16 


0.16 


0.41 


2.45 


Time (s) 


1.59 X 10^ 


0.13 


1.15 


67.71 


0.69 X 10^ 



MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods. 
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Table 5 Simulation results for dataset X>4 



(a) /? = 2 





uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE2(x10-2) 


0.02 


N/A 


5.12 


N/A 


N/A 




1.61 


0.01 


3.59 


15.35 


18.69 




0.05 


0.44 


N/A 


14.42 


19.20 


GMSE2(x10-1) 


0.28 


N/A 


3.23 


N/A 


N/A 




0.87 


0.02 


2.65 


0.33 


1.62 




0.69 


0.76 


N/A 


0.50 


1.30 


SAD2(x10-^) 


0.34 


N/A 


4.25 


N/A 


N/A 




3.08 


0.17 


3.71 


14.90 


14.89 




0.51 


0.68 


N/A 


15.59 


15.70 


GSAD(x10-2) 


4.97 


5.24 


5.25 


157.09 


156.19 


RE (x 10-4) 


4.49 


4.88 


4.89 


19.34 


8.48 


Time (s) 


1.61 X 10^ 


0.02 


1.36 


35.29 


0.40 X 10^ 


(b)/? = 3 




uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE2(x10-2) 


0.02 


0.01 


6.18 


18.38 


21.63 




1.61 


0.01 


4.79 


16.10 


19.55 




0.05 


0.09 


4.21 


15.04 


19.85 


GMSE^(x10-^) 


0.28 


0.05 


1.67 


1.44 


1.29 




0.87 


0.05 


1.01 


0.37 


1.75 




0.69 


0.05 


0.94 


0.26 


1.17 


SAD2(x10-^) 


0.34 


0.27 


4.12 


15.21 


15.65 




3.08 


0.17 


4.09 


15.26 


15.90 




0.51 


0.32 


4.16 


16.07 


15.36 


GSAD(x10-2) 


4.97 


4.95 


4.99 


157.08 


1 54.80 


RE (x 10-4) 


4.49 


4.34 


4.36 


25.00 


8.48 


Time (s) 


1.61 X 10^ 


0.10 


1.78 


41.05 


0.55 X 10^ 


(c) /? = 4 




uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


MSE2(x10-2) 


0.02 


0.01 


6.98 


17.51 


21.60 




1.61 


0.01 


7.30 


15.07 


19.03 




0.05 


0.07 


4.27 


14.55 


19.14 


GMSE2(x10-^) 


0.28 


0.22 


0.65 


0.75 


1.29 




0.87 


0.51 


0.91 


0.77 


1.18 




0.69 


0.05 


0.56 


0.56 


1.33 


SAD2(x10-^) 


0.34 


0.27 


4.41 


15.61 


15.51 




3.08 


0.19 


4.81 


16.31 


14.77 




0.51 


0.33 


4.00 


15.84 


15.26 


GSAD(x10-2) 


4.97 


4.91 


4.94 


156.76 


162.63 


RE (x 10-4) 


4.49 


4.30 


4.33 


13.48 


8.29 


Time (s) 


1.61 X 10^ 


0.16 


1.56 


48.22 


0.70 X 10^ 



MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algoritlim and PCA, NMF, BFRM and GB-GMF methods. 
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c 

Figure 1 Experimental results on the H3N2 viral challenge dataset of gene expression profiles, (a) Estimated posterior distribution of tine 
number of factors R. (b) Factor loadings ranl<ed by decreasing dominance, (c) Heatmap of tine factor scores of tine inflammatory component clearly 
separates symptomatic subjects (bottom 9 rows) and the time course of their molecular inflammatory response. The five black colored pixels 
indicate samples that were not assayed. 



number of factors is Rma? = 4 (more than 90% of the gen- 
erated Gibbs samples of the number of factors were equal 
to 4). 

Figure 2 shows the reconstruction error RE^^^ as a func- 
tion of the number of iterations (t = 1, . . .)♦ The recon- 
struction errors are computed from the observed gene 
expression data matrix and the estimates of the factor and 
factor score matrix M and A at each iteration. Figure 2 
also indicates that the number of burn-in and Monte Carlo 
samples Nbi = 1000 and Nmc = 10000 are sufficient. 

The different factors are depicted in Figure lb where 
the G genes have been reordered so that the dominant 
genes are grouped together in each factor. Factors are then 



ordered with respect to their maximum loading. Specifi- 
cally, the k'th sharp peak in the figure occurs at the gene 
index that has maximal loading in factor /c. Genes to the 
right of this dominant gene up to the (k + l)-st peak 
also dominate in this /c-th factor, but at a lower degree. 
uBLU identifies a strong factor (the first factor, in red) by 
virtue of its significantly larger proportion of highly dom- 
inant genes. Many of the genes in this strong factor are 
recognizable as immune response genes that regulate pat- 
tern recognition, interferon, and inflammation pathways 
in respiratory viral response. A very similar factor was 
found in a different analysis [14,18] of this dataset and 
here we call it the "inflammatory component" 
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Figure 2 Reconstruction error and estimated number of factors as a function of the number of iterations (H3N2 challenge data). Top: 
Reconstruction error (RE^^^) computed from the observation matrix Y and tine estimated matrices M*^''^ and A*^^^ as a function of tine iteration index t. 
Bottom: Estimated number of factors R^^^ as a function of tine iteration number t. 



The factor scores corresponding to this inflamma- 
tory component are shown in Figure Ic, where they 
are rendered as an image whose columns (respectively 
rows) index the subjects (respectively the different time 
sampling instants). Figure Ic shows that uBLU clearly 
separates the samples of subjects exhibiting symptoms 
(associated with the last 9 rows) from those who remain 
asymptomatic (associated with the first 8 rows), when 
the estimated number of factors is ^ = 4. Moreover, 
this symptom factor can be used to segment the data 
matrix into 3 states: pre-onset-symptomatic (before sig- 
nificant symptoms occur), post-onset-symptomatic and 
asymptomatic. 



Furthermore, this inflammatory factor identified by the 
proposed uBLU algorithm is most highly represented in 
those samples associated with acute flu symptoms, as 
measured by modified Jackson scores (see [14], Figure IB). 
The dominant gene contributors to this inflamma- 
tory component correspond to well-known transcrip- 
tion factors controlling immune response, inflammatory 
response and antigen presentation - see Table 6. The 
reader is referred to [14,18] for more details on clinical 
determination of symptom scores and biological signifi- 
cance of the inflammatory component genes. 

For comparison we applied a supervised version of the 
proposed uBLU algorithm to the H3N2 dataset. This was 



Table 6 NCI-curated pathway associations of group of genes contributing to uBLU inflammatory component 



Pathway name 


Genes 


P-value 


IFN-gamma patliway 


CASPl, CEBPB, ILIB, IRFl, IRF9, PRKCD, SOCSl, STATl, 
STATS 


1 .S4e-09 


PDGFR-beta signaling patliway 


D0CK4, EIF2AK2, FYN, HCK, LYN, PRKCD, SLA, SRC, 
STATl , STATS, STAT5A, STAT5B 


S.26e-08 


IL23-mediated signaling events 


CCL2, CXCLl , CXCL9, 111 B, STATl , STATS, STAT5A 


2.18e-07 


Signaling events mediated byTCPTP 


EIF2AK2, SRC, STATl, STATS, STAT5A, STAT5B, STAT6 


6.S8e-07 


Signaling events mediated by PTPl B 


FYN, HCK, LYN, SRC, STATS, STAT5A, STAT5B 


2.40e-06 


GMCSF-mediated signaling events 


CCL2, LYN, STATl , STATS, STATS A, STATS B 


S.70e-06 


IL12-mediated signaling events 


HLA-A, ILl B, SOCSl , STATl , STATS, STATSA, STAT6 


1 .S2e-0S 


IL6-mediated signaling events 


CEBPB, HCK, IRFl , PRKCD, STATl , STATS 


1 .80e-0S 



NCI-curated pathway associations of group of genes contributing to uBLU inflammatory component, whose factor scores are shown in Figure 1 (Source: NCI pathway 
interaction database http://pid.nci.nih.gov). Genes in uBLU factor are significantly better represented in the NCI-curated pathways than the genes in NMF (compare 
p-values here to those in Table 8). 
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implemented by setting the number of factors to R = 4 
and using the algorithm [13] to jointly estimate M and 
A. The inflammatory component found by the super- 
vised algorithm was virtually identical to the one found by 
the proposed algorithm (uBLU) that automatically selects 
7^ = 4. 

Comparison to other methods 

The uBLU algorithm is compared with four matrix factor- 
ization algorithms, i.e. PCA, NMF, BFRM and GB-GMF 
methods. 

Figure 3 depicts the different factors, ordered so that 
the inflammatory group is the leftmost one (in red). 
The factor loadings obtained with NMF or PCA reveal 
the inflammatory component. However, there are fewer 



highly dominant genes in the NMF and PCA loadings for 
this factor as compared to uBLU. The BFRM and GB-GMF 
methods found four pathways, several overlapping with 
those of uBLU, NMF and PCA. 

The factor scores of the five matrix factorization meth- 
ods corresponding to the inflammatory component are 
depicted in Figure 4. This figure shows that the uBLU and 
the NMF methods are better able to attain a high con- 
trast separation between the acutely symptomatic samples 
and the other samples. This is confirmed by the evalua- 
tion of the Fisher criteria (22) between these two regions 
(see Table 7). Indeed, denote by (/Xpos> cr^pos) the empirical 



mean and variance of the scores associated with the Nr 



pos 



samples in the acute symptomatic state (bright colored 
samples in the lower right rectangle of Figure Ic). Denote 
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Figure 3 Factor loadings ranlced by decreasing dominance for H3N2 chiallenge data. uBLU shows a particularly strong component (Figure 1 b), 
the group ftl, that corresponds to the well-known inflammatory pathway. NMF and PCA algorithms also reveal an inflammatory component, but it 
includes fewer relevant genes than uBLU. See Figure 4 for the corresponding factor scores. 
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Figure 4 Heatmaps of the factor scores of the inflammatory component for H3N2 challenge data. The inflammatory factor determined by 
tine proposed uBLU metliod (a) sliows liiglier contrast between symptomatic and asymptomatic subjects tlian tine otiier metliods. Tine five blacl< 
colored pixels of the heatmaps indicate samples that were not assayed. 
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Table 7 Simulation results for real H3N2 dataset 





uBLU 


PCA 


NMF 


BFRM 


GB-GMF 


Fisher criteria (x10~2) (22) 


6.20 


2.03 


6.17 


4.68 


2.30 


RE 


64810-2 


4.89 


7.31.10-2 


4.82 


9.51.10-2 


Time 


^]2h 


1.5s 


116s 


^ 47 min 


^]0h 


Number of iterations 


10 000 


N/A 


5 000 


10 000 


500 



Measure of the Fisher linear discriminant measure ([23], p. 11 9) between post-onset symptomatic samples and the other samples on heatmaps (Figure 4), 
reconstruction error (RE) between the observed data and the MAP estimators, computational times (for an implementation in MATLAB 7.8.0 (R2009a) on a 3 GHz 
Intel(R) Core(TM)2 Duo processor), and corresponding number of iterations. 



by (mpos) <^^pos) the same parameters for the remaining 
samples. The Fisher linear discriminant measure ([23], 
p. 119) is defined as 

(Mpos - Mpos) ^22^ 

A^posO^^pos + (N - A/^pos)o^^pos * 

To compare the biological relevance of the inflamma- 
tory genes found by uBLU to those found by the other 
methods we performed gene enrichment analysis (GEA). 
Here we only report GEA comparisons between uBLU 
and NMF. Tables 6 and 8 show the pathway enrich- 
ment associated with the top 200 genes found by uBLU 
and NMF, respectively, using NCI pathway interaction 
database (http://pid.nci.nih.gov). The uBLU genes are sig- 
nificantly better associated with the NCI-curated path- 
ways than the NMF genes. In particular, the two most 
enriched pathways, IFN-gamma and PDGFR beta signal- 
ing, associated with the uBLU genes have much higher 
statistical significance (lower p-value) than any of the 
pathways associated with NMF. 

Figure 5 shows how the factor scores of the dominant 
factor can be used as features to cluster samples. Euclidean 
multidimensional scaling (MDS) [24] is used to map the 
factor score vector for each sample as a coordinate in the 
plane. Each sample is embedded with a color and a size. 



denoting the state of the subject (asymptomatic subjects 
in blue, symptomatic subjects in red) and the time stamp, 
respectively. These figures show that uBLU can separate 
sick and healthy subjects, as well as or better than NMF, 
BFRM and GB-GME 

One can conclude from these comparisons that, when 
applied on the H3N2 dataset, the proposed uBLU algo- 
rithm outperforms PCA, NMF, BFRM, and GB-GMF 
algorithms in terms of finding genes with higher path- 
way enrichment and achieving higher contrast of the acute 
symptom states. 

The computational times required by the five consid- 
ered matrix factorization methods, including the pro- 
posed uBLU algorithm, when applied to this real dataset, 
are reported in Table 7. The GB-GMF algorithm is slightly 
faster than the proposed algorithm but does not identify 
the inflammatory component or achieve good contrast of 
the acute symptom states in the H3N2 challenge study. 

Conclusions 

This paper proposes a new Bayesian unmixing algorithm 
for discovering signatures in high dimensional biologi- 
cal data, and specifically for gene expression microarrays. 
An interesting property of the proposed algorithm is that 
it provides positive factor loadings to ensure positivity 
as well as sum-to-one constraints for the factor scores. 



Table 8 NCI-curated pathway associations of group of genes contributing to NMF inflammatory component 



Pathway name 


Genes 


P-value 


IL23-mediated signaling events 


CCL2, CXCLl , CXCL9, ILl B, JAK2, STATl , STAT5A 


2.18e-07 


IL12-nnediated signaling events 


GADD45B, IL1B, JAK2, MAP2K6, SOCSl, STATl, 
STAT5A, STAT6 


1.10e-06 


IFN-gamma pathway 


CASPl, ILl B, IRF9, JAK2, SOCSl, STATl 


1 .07e-05 


Signaling events mediated byTCPTP 


EIF2AK2, PIK3R2, STATl , STAT5A, STAT5B, STAT6 


1 .07e-05 


IL27-mediated signaling events 


ILl B, JAK2, STATl , STAT2, STAT5A 


1 .22e-0S 


CXCR3-mediated signaling events 


CXCLl a CXCLl 1 , CXCLl 3, CXCL9, MAP2K6, PIK3R2 


1 .23e-0S 


GMCSF-mediated signaling events 


CCL2, JAK2, STATl , STATS A, STATS B 


6.24e-0S 


PDGFR-beta signaling pathway 


EIF2AK2, JAK2, PIK3R2, ARAP1 , D0CK4, STATl , STATS A, 
STATS B 


1 .38e-04 



NCI-curated pathway associations of group of genes contributing to NMF inflammatory component, whose factor scores are shown in Figure 4 (Source: NCI pathway 
interaction database http://pid.nci.nih.gov). Genes in uBLU factor are significantly better represented in the NCI-curated pathways than the genes in NMF (compare 
p-values here to those in Table 6). 
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0.5 



-1 0 
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e GB-GMF 

Figure 5 Chip clouds after demixing for H3N2 chiallenge data. These figures show the scatter of the four dimensional factor score vectors 
(projected onto the plane using MDS) for each algorithm that was compared to uBLU. uBLU, NMF and BFRM obtain a clean separation of samples of 
symptomatic (red points) and asymptomatic (blue points) subjects whereas the separation is less clear with PCA. In these scatter plots the size of a 
point is proportional to the time at which the sample was taken during challenge study. 
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The advantages of these constraints are that they lead 
to better discrimination between sick and healthy indi- 
viduals, and they recover the inflammatory genes in a 
unique factor, the inflammatory component. The pro- 
posed algorithm is fully unsupervised in the sense that 
it does not depend on any labeling of the samples and 
that it can infer the number of factors directly from the 
observation data matrix. Finally, as any Bayesian algo- 
rithm, the Monte Carlo-based procedure investigated in 
this study provides point estimates as well as confidence 
intervals for the unknown parameters, contrary to many 
existing factor decomposition methods such as PCA or 
NMF. 

Simulation results performed on synthetic and real data 
demonstrated significant improvements. Indeed, when 
applied to real time-evolving gene expression datasets, 
the uBLU algorithm revealed an inflammatory factor with 
higher contrast between subjects who would become 
symptomatic from those who would remain asymp- 
tomatic (as determined by comparing to ground truth 
clinical labels). 

In this study, the time samples were modeled as inde- 
pendent. Future works include extensions of the proposed 
model to account for time dependency between samples. 

Appendix A: Gibbs sampler 

This appendix provides more details about the Gibbs sam- 
pler strategy to generate samples {M^^\A^^\a^^^\R^^'^} 
distributed according to the joint distribution 
/ (M, A, a^,7?|Y) (the reader is referred to [25] for more 
details about the Gibbs sampler and MCMC methods). 
This joint distribution can be obtained by integrating out 
the hyperparameter y from/ (0, y | Y) defined in (18) and 
can be written 

/(M,A,a2,7?|Y) oc/(Y|M,A,a2,7?) 

xf(A\R) 

where the dimensions of the matrices M, T, and A depend 
on the unknown number of factors R and the priors have 
been defined in the Section "Methods'! 

The different steps of the Gibbs sampler are detailed 
below. 

Inference of the number of factors 

The proposed unsupervised algorithm includes a 
birth/death process for inferring the number of factors i?, 
i.e., it generates samples R in addition to M and A. More 
precisely, at iteration t of the algorithm, a birth, death 



or switch move is randomly chosen with probabilities 
bj^(t)y dj^{t) and Sj^n). The birth and death moves con- 
sist of increasing or decreasing by 1 the number R of 
factors using a reversible jump step (see [21] for more 
details), whereas the switch move does not change the 
dimension of R and requires the use of a Metropolis- 
Hastings acceptance procedure. Let consider a move, 
at iteration index U from the state {M^^\ A!'^\R^^^} to 
the new state {M*,A*,i?*}. The birth, death and switch 
moves are defined as follows, similar to those used in [26] 
(Algorithms 3, 4 and 5). 

• Birth move: When a birth move is proposed, a new 
signature m* is randomly generated to build 

= [M^^),m*].The new corresponding space is 
checked so that the signatures are sufficiently distinct 
and separate from one another. Then, a new factor 
score coefficient is drawn, for each vector a/ 
(/ = 1, . . . ,Ar), from a Beta distribution B (l,7?^^^), 
and the new factor score matrix, denoted as A*, is 
re-scaled to sum to one. 

• Death move: When a death move is proposed, one 
of the factors of M^^\ and its corresponding factor 
score coefficients, are randomly removed. The 
remaining factor scores are re-scaled to ensure the 
sum-to-one constraint. 

• Switch move: When a switch move is proposed, a 
signature m* is randomly chosen and replaced with 
another signature randomly generated. If the new 
signature is too close to another, its corresponding 
factor scores are proportionately distributed among 
its closest factors. Indeed, the switch move consists of 
creating a new signature (birth move) and deleting 
another one (death move) in a faster single step. 

Each move is then accepted or rejected according to 
an empirical acceptance probability: the likelihood ratio 
between the actual state and the proposed new state. 
The factor matrix M, the factor score matrix A and 
the noise variance are then updated, conditionally 
upon the number of factors R, using the following Gibbs 
steps. 

Generation of samples according to f (T| A, a^, R, Y) 

Sampling from the joint conditional /(T| A, a ^,i?,Y) is 
achieved by updating each column of T using Gibbs 
moves. Let denote Ty the matrix T whose r-th column 
has been removed. The posterior distribution of tr is the 
following truncated multivariate Gaussian distribution 
(MGD) 

t,|Ty, a„ or^ Y - Mrr (Tr, r,) (24) 
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where 



Ir 



E 



PS 



ar,iVtVj. 



(25) 



For more details on how we generate reaUzations from this 
truncated distribution, see [13]. 

Generation of samples according to f (ai:/?_i,/|T,(7^,/?, Y) 

Straightforward computations lead to the posterior distri- 
bution of each element of ^\:R-\^i 



/(ai:7?-i,/|T,or2,i?,Y) aexp 



I5 (ai:7?-l,i) 



where 

i:i:R-u = [MlR^-m\RY\ 



M\R = M 



\R 



^R'^R-V 



(26) 



(27) 



= [1, . . . , 1] e ^ and M\r denotes the factor 
loading matrix M whose R-th column has been removed. 





-12 0 5 1221 529 3645.553 6069 577 8493.5101108 -12 0 5 1221 529 3645.553 6069 577 8493.5101108 



Time index Time index 

C When only the sum-to-one constraint is enforced d When all the constraints (positivity and sum-to-one) are 

enforced 

Figure 6 Contribution of each constraint on the scores of the inflammatory factor (H3N2 challenge data). The five black colored pixels of the 
heatmaps indicate samples that were not assayed. Note that when only the sum-to-one constraint is applied, non-negativity is not guaranteed. 
However, for this dataset the sum-to-one factor scores turn out to take on non-negative values for the inflammatory factor (but not for the other 
factors). 
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Table 9 Contribution of each of uBLU's constraints 





Without 
constraints 


Positivity 


Sum-to-one 


Positivity and 
sum-to-one 


P-value of the "IFN-gamma pathway" 


6.00.10-2 


2.05.10-2 


2.17.10-^ 


1.34 10-^ 


P-value of the "IL23-mediated signaling events" 


2.60.10-^ 


8.37.10-2 


2.28.10-2 


2.1810-^ 



Benefit of constraints in uBLU in terms of gene enrichment in the NCI-curated IFN-gamma and IL23-mediated pathways. As in Tables 6 and 8, the top 200 genes in the 
inflammatory components, whose scores are shown in Figures 6(a-d), were analyzed using the NCI Pathway Interaction Database. Both positivity and sum-to-one 
constraints are necessary for uBLU to reveal these two pathways with the high significance (p-value less than 1 0~^). 



Equation (26) shows that the factor score distribution is 
an MGD truncated on the simplex S defined in (12). 

Generation of samples according to f (a^ |T, A, R, Y) 

Using (14) and (16), one can show that the conditional dis- 
tribution /(a ^|M, A, Y) is the following inverse-Gamma 
distribution 

/gn 1 ^ \ 
or^lM, A, Y ~ ig 1^—, - J2 hi - Ma/ll'j . (28) 

Appendix B: Contribution of each of uBLU's 
constraints 

To illustrate the advantage of enforcing non-negativity 
and sum-to-one constraints on the factors and on the 
factor scores, as detailed in the Methods section, we eval- 
uated the effect of successively stripping out these con- 
straints from uBLU. In particular we implemented uBLU 
under the following conditions: i) without any constraints, 
ii) with only the positivity constraints on the factors and 
the scores, iii) with only the sum-to-one constraint on 
the scores, and iv) with both positivity and sum-to-one 
constraint on factors and scores as proposed in (5). 

Figures 6 display heatmaps of the factor scores of the 
inflammatory component. The segmentation into two 
main regions (post-symptomatic samples and asymp- 
tomatic samples) becomes apparent only when the sum- 
to-one constraint is enforced on the scores. To quantify 
the benefit that is visible in Figure 6 we performed a GEA 
analysis, reported in Table 9, on the top 200 genes found 
in each of the inflammatory components found by uBLU 
implemented with no constraints, positivity constraints, 
sum-to-one constraints, and both constraints. The table 
shows that both constraints are necessary to obtain the 
best enrichment scores (lowest possible p-values). 

Additional file 

f \ 

Additional file 1 : Supplementary materials on algorithm details and 
performance validation. Directed acyclic graph (DAG) of the model and 
flowchart of the proposed algorithm are provided in this additional file. 
More results on synthetic datasets are also presented to validate the 
proposed Bayesian algorithm, including a convergence diagnosis. 
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